Exploring The Simpsons
Introduction
The Simpsons is the world’s longest-running animated sitcom, and it was created by the American writer Matt Groening. It is a satirical depiction of the working-class life of the Simpson family, which consists of Homer, Marge, Bart, Lisa, and Maggie. The show is set in the fictional town of Springfield, and it parodies culture, society, and the human condition.
This report illustrates an exploratory and text analysis of the many data sets about the show available at the data science platform Kaggle and the #tidytuesday Github repository. Let us start by loading the required packages.
library(tidyverse)
library(readr)
library(kableExtra)
library(treemapify)
library(gridExtra)
library(ggpubr)
library(tidytext)
library(RColorBrewer)
library(tm)
library(topicmodels)
library(scales)
library(treemap)
library(igraph)
library(ggraph)
library(ggwordcloud)
library(udpipe)
library(ggcorrplot)
library(GGally)
theme_set(theme_bw())
show_table <- function(x, caption ="", head = 50, scroll = FALSE, full.width = FALSE,
digits = 2, col.names = NA, align = NULL){
table <- x %>%
head(head) %>%
kable(caption = caption, digits = digits, col.names = col.names, align = align,
format.args = list(decimal.mark = ".", big.mark = ",")) %>%
kable_styling("striped", position = "left", full_width = full.width)
if(scroll){
table <- table %>%
scroll_box(width = "100%", height = "500px")
}
return(table)
}
firstup <- function(x) {
substr(x, 1, 1) <- toupper(substr(x, 1, 1))
x
}
colors <- c('#fed41d', '#0094c7', '#f14e28')
palette <- c("#acba81ff", "#2a9430ff", "#ae6b1bff","#024CF0", "#28536bff", "#68aedeff", "#8928e8ff",
"#f25d30ff", "#d63d2aff", "orchid", "#b49ba0ff", "darkorange", "#ef4d8bff", "#ffc510ff",
"lightsalmon1", "azure4", "aquamarine3")The five data sets we will be working with are simpsons_characters.csv, simpsons_locations.csv, simpsons_script_lines.csv, simpsons_episodes.csv, and simpsons-guests.csv.
characters <- read_delim("Data/simpsons_characters.csv", delim = ",")
locations <- read_delim("Data/simpsons_locations.csv", delim = ",")
episodes <- read_delim("Data/simpsons_episodes.csv", delim = ",")
dialogues <- read_delim("Data/simpsons_script_lines.csv", delim = ",")
guests <- read_delim("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-08-27/simpsons-guests.csv", delim = "|", quote = "")Pre-processing
Before starting with the actual analysis, it is convenient to perform the following pre-processing operations.
charactersdata set: the levels ofgenderare recoded asMale,Female, andUnknown.dialoguesdata set: we retain the speaking lines and create identifiers for the row number.episodesdata set: we create a variable denoting the episode number: the format of the episode number follows the rules described here (i.e., the first number refers to the order it aired during the entire series, and the second number refers to the episode number within its season). We also filter theepisodesuntil season 27, since the information about season 28 is partial (there are only 4 episodes).guestsdata set: the Movie seasons are excluded, and a logical variableselfis created, which is equal toTRUEif a given guest star has played themselves in a particular episode, andFALSEif they have instead voiced a regular character. Some additional recoding on specific character names is carried out.
characters <- characters %>%
mutate(gender = fct_explicit_na(gender, na_level = "Unknown"),
gender = fct_recode(gender, Male = "m", Female = "f"))
dialogues <- dialogues %>%
filter(speaking_line) %>%
mutate(line_number = row_number()) %>%
select(line_number, episode_id, number, role = raw_character_text,
location = raw_location_text, line = spoken_words)
episodes <- episodes %>%
na.omit() %>%
mutate(part1 = sprintf("%03d", id),
number_in_season = sprintf("%02d", number_in_season)) %>%
unite("part2", c(season, number_in_season), sep = "", remove = FALSE) %>%
unite("number", c(part1, part2), sep = "–", remove = FALSE) %>%
filter(season <= 27) %>%
select(season, number, episode_id = number_in_series, prod_code = production_code,
year = original_air_year, title, rating = imdb_rating,
votes = imdb_votes, views, us_views = us_viewers_in_millions)
guests <- guests %>%
filter(!season %in% "Movie") %>%
mutate(season = parse_number(season)) %>%
separate_rows(role, sep = ";\\s+") %>%
mutate(self = str_detect(role, "self|selves"),
role = ifelse(role == "Edna Krabappel", "Edna Krabappel-Flanders", role),
role = ifelse(role == "Fit Tony", "Fat Tony", role)) %>%
rename(title = episode_title, prod_code = production_code)We can now look at the pre-processed data sets characters, locations, episodes, dialogues, and guests.
Characters
| id | name | normalized_name | gender |
|---|---|---|---|
| 7 | Children | children | Unknown |
| 12 | Mechanical Santa | mechanical santa | Unknown |
| 13 | Tattoo Man | tattoo man | Unknown |
| 16 | DOCTOR ZITSOFSKY | doctor zitsofsky | Unknown |
| 20 | Students | students | Unknown |
| 24 | Little Boy | little boy | Unknown |
| 26 | Lewis Clark | lewis clark | Unknown |
| 27 | Little Girl | little girl | Unknown |
| 29 | Bubbles | bubbles | Unknown |
| 30 | Moldy | moldy | Unknown |
| 34 | Ticket Seller | ticket seller | Unknown |
| 35 | Elf #1 | elf 1 | Unknown |
| 36 | Elves | elves | Unknown |
| 37 | Dog’s Owner | dogs owner | Unknown |
| 39 | Kids | kids | Unknown |
| 41 | Conductor | conductor | Unknown |
| 42 | Secretary | secretary | Unknown |
| 46 | Sydney | sydney | Unknown |
| 47 | Cecile Shapiro | cecile shapiro | Unknown |
| 48 | Ian | ian | Unknown |
| 49 | Calvin | calvin | Unknown |
| 50 | Martin Prince, Sr. | martin prince sr | Unknown |
| 51 | Richard | richard | Unknown |
| 53 | Wendell Borton | wendell borton | Unknown |
| 57 | Smilin’ Joe Fission | smilin joe fission | Unknown |
| 58 | Rod #1 | rod 1 | Unknown |
| 59 | Rod #2 | rod 2 | Unknown |
| 60 | RODS | rods | Unknown |
| 61 | Workman #1 | workman 1 | Unknown |
| 62 | Foreman | foreman | Unknown |
| 63 | TERRI & SHERRI | terri sherri | Unknown |
| 64 | PUNK TEENAGER | punk teenager | Unknown |
| 65 | Tv Announcer #1 | tv announcer 1 | Unknown |
| 66 | Tv Announcer #2 | tv announcer 2 | Unknown |
| 67 | Jingle Chorus | jingle chorus | Unknown |
| 68 | Sylvia Winfield | sylvia winfield | Unknown |
| 69 | Old Man Winfield | old man winfield | Unknown |
| 70 | Councilman #1 | councilman 1 | Unknown |
| 72 | Councilman #2 | councilman 2 | Unknown |
| 73 | COUNCILMEN #1/#2 | councilmen 12 | Unknown |
| 74 | Demonstrator #1 | demonstrator 1 | Unknown |
| 75 | Crowd | crowd | Unknown |
| 76 | MR. GAMMILL | mr gammill | Unknown |
| 77 | TOM | tom | Unknown |
| 78 | Mrs. Long | mrs long | Unknown |
| 79 | Wife #1 | wife 1 | Unknown |
| 80 | Wife #2 | wife 2 | Unknown |
| 81 | Other Women | other women | Unknown |
| 82 | Nice Father | nice father | Unknown |
| 83 | Nice Boy | nice boy | Unknown |
Locations
| id | name | normalized_name |
|---|---|---|
| 1 | Street | street |
| 2 | Car | car |
| 3 | Springfield Elementary School | springfield elementary school |
| 4 | Auditorium | auditorium |
| 5 | Simpson Home | simpson home |
| 6 | KITCHEN | kitchen |
| 7 | SHOPPING MALL PARKING LOT | shopping mall parking lot |
| 8 | Springfield Mall | springfield mall |
| 9 | The Happy Sailor Tattoo Parlor | the happy sailor tattoo parlor |
| 10 | Springfield Nuclear Power Plant | springfield nuclear power plant |
| 11 | PLANT | plant |
| 12 | DERMATOLOGY CLINIC | dermatology clinic |
| 13 | Laboratory | laboratory |
| 14 | Circus of Values | circus of values |
| 15 | Moe’s Tavern | moe tavern |
| 16 | Santa School | santa school |
| 17 | Santa’s Workshop | santa workshop |
| 18 | WORKSHOP | workshop |
| 19 | PERSONNEL OFFICE | personnel office |
| 20 | Springfield Downs Dog Track | springfield downs dog track |
| 21 | SPRINGFIELD DOWNS | springfield downs |
| 22 | PADDOCK | paddock |
| 23 | SPRINGFIELD DOWN | springfield down |
| 24 | SPRINGFIELD DOWNS PARKING LOT | springfield downs parking lot |
| 25 | Simpson Living Room | simpson living room |
| 26 | Springfield Elementary School Playground | springfield elementary school playground |
| 27 | CLASSROOM | classroom |
| 28 | Skinner’s Office | skinner office |
| 29 | Homer’s Car | homer car |
| 30 | NEW SCHOOL | new school |
| 31 | Opera House | opera house |
| 32 | OLD SCHOOL | old school |
| 33 | NEW CLASSROOM | new classroom |
| 34 | SCHOOL BUILDING | school building |
| 35 | Simpson Back Porch | simpson back porch |
| 36 | Bus | bus |
| 37 | Road | road |
| 38 | Conference Room | conference room |
| 39 | COFFEE ROOM | coffee room |
| 40 | Bar | bar |
| 41 | Berger’s Burgers | berger burgers |
| 42 | REFRIGERATOR | refrigerator |
| 43 | Bart’s Bedroom | bart bedroom |
| 44 | Simpson Backyard | simpson backyard |
| 45 | Simpson Neighborhood | simpson neighborhood |
| 46 | Master Bedroom | master bedroom |
| 47 | LIVING ROOM | living room |
| 48 | Springfield Town Hall | springfield town hall |
| 49 | CITY COUNCIL CHAMBERS | city council chambers |
| 50 | Park | park |
Episodes
| season | number | episode_id | prod_code | year | title | rating | votes | views | us_views |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 010–110 | 10 | 7G10 | 1,990 | Homer’s Night Out | 7.4 | 1,511 | 50,816 | 30.3 |
| 1 | 012–112 | 12 | 7G12 | 1,990 | Krusty Gets Busted | 8.3 | 1,716 | 62,561 | 30.4 |
| 2 | 014–201 | 14 | 7F03 | 1,990 | Bart Gets an “F” | 8.2 | 1,638 | 59,575 | 33.6 |
| 2 | 017–204 | 17 | 7F01 | 1,990 | Two Cars in Every Garage and Three Eyes on Every Fish | 8.1 | 1,457 | 64,959 | 26.1 |
| 2 | 019–206 | 19 | 7F08 | 1,990 | Dead Putting Society | 8.0 | 1,366 | 50,691 | 25.4 |
| 2 | 021–208 | 21 | 7F06 | 1,990 | Bart the Daredevil | 8.4 | 1,522 | 57,605 | 26.2 |
| 2 | 023–210 | 23 | 7F10 | 1,991 | Bart Gets Hit by a Car | 7.8 | 1,340 | 56,486 | 24.8 |
| 2 | 026–213 | 26 | 7F13 | 1,991 | Homer vs. Lisa and the 8th Commandment | 8.0 | 1,329 | 58,277 | 26.2 |
| 2 | 028–215 | 28 | 7F16 | 1,991 | Oh Brother, Where Art Thou? | 8.2 | 1,413 | 47,426 | 26.8 |
| 2 | 030–217 | 30 | 7F17 | 1,991 | Old Money | 7.6 | 1,243 | 44,331 | 21.2 |
| 2 | 032–219 | 32 | 7F19 | 1,991 | Lisa’s Substitute | 8.5 | 1,684 | 52,770 | 17.7 |
| 2 | 035–222 | 35 | 7F22 | 1,991 | Blood Feud | 8.0 | 1,223 | 52,829 | 17.3 |
| 3 | 037–302 | 37 | 8F01 | 1,991 | Mr. Lisa Goes to Washington | 7.7 | 1,274 | 52,098 | 20.2 |
| 3 | 039–304 | 39 | 8F03 | 1,991 | Bart the Murderer | 8.7 | 1,446 | 64,342 | 20.8 |
| 3 | 041–306 | 41 | 8F05 | 1,991 | Like Father, Like Clown | 7.7 | 1,262 | 45,586 | 20.2 |
| 3 | 044–309 | 44 | 8F07 | 1,991 | Saturdays of Thunder | 7.9 | 1,194 | 55,808 | 24.7 |
| 3 | 046–311 | 46 | 8F09 | 1,991 | Burns Verkaufen der Kraftwerk | 8.2 | 1,291 | 55,987 | 21.1 |
| 3 | 048–313 | 48 | 8F11 | 1,992 | Radio Bart | 8.5 | 1,365 | 58,919 | 24.2 |
| 3 | 051–316 | 51 | 8F16 | 1,992 | Bart the Lover | 8.3 | 1,272 | 53,123 | 20.5 |
| 3 | 053–318 | 53 | 8F15 | 1,992 | Separate Vocations | 8.2 | 1,201 | 61,508 | 23.7 |
| 3 | 055–320 | 55 | 8F19 | 1,992 | Colonel Homer | 7.9 | 1,233 | 46,901 | 25.5 |
| 3 | 058–323 | 58 | 8F22 | 1,992 | Bart’s Friend Falls in Love | 7.8 | 1,160 | 48,058 | 19.5 |
| 4 | 060–401 | 60 | 8F24 | 1,992 | Kamp Krusty | 8.4 | 1,414 | 67,081 | 21.8 |
| 4 | 065–406 | 65 | 9F03 | 1,992 | Itchy & Scratchy: The Movie | 8.2 | 1,293 | 55,740 | 20.1 |
| 4 | 069–410 | 69 | 9F08 | 1,992 | Lisa’s First Word | 8.5 | 1,350 | 62,070 | 28.6 |
| 4 | 072–413 | 72 | 9F11 | 1,993 | Selma’s Choice | 8.0 | 1,153 | 56,396 | 24.5 |
| 1 | 007–107 | 7 | 7G09 | 1,990 | The Call of the Simpsons | 7.9 | 1,638 | 57,793 | 27.6 |
| 2 | 024–211 | 24 | 7F12 | 1,991 | One Fish, Two Fish, Blowfish, Blue Fish | 8.8 | 1,687 | 50,206 | 24.2 |
| 4 | 080–421 | 80 | 9F20 | 1,993 | Marge in Chains | 7.7 | 1,080 | 68,692 | 17.3 |
| 5 | 082–501 | 82 | 9F21 | 1,993 | Homer’s Barbershop Quartet | 8.4 | 1,416 | 58,390 | 19.9 |
| 5 | 084–503 | 84 | 1F02 | 1,993 | Homer Goes to College | 8.6 | 1,476 | 64,802 | 18.1 |
| 5 | 087–506 | 87 | 1F03 | 1,993 | Marge on the Lam | 8.0 | 1,132 | 53,490 | 21.7 |
| 5 | 089–508 | 89 | 1F06 | 1,993 | Boy-Scoutz ’n the Hood | 8.7 | 1,270 | 83,238 | 20.1 |
| 5 | 092–511 | 92 | 1F09 | 1,994 | Homer the Vigilante | 8.2 | 1,202 | 74,673 | 20.1 |
| 5 | 093–512 | 93 | 1F11 | 1,994 | Bart Gets Famous | 8.1 | 1,123 | 66,267 | 20.0 |
| 5 | 095–514 | 95 | 1F12 | 1,994 | Lisa vs. Malibu Stacy | 8.2 | 1,187 | 61,715 | 20.5 |
| 5 | 098–517 | 98 | 1F15 | 1,994 | Bart Gets an Elephant | 7.9 | 1,116 | 63,427 | 17.0 |
| 5 | 102–521 | 102 | 1F21 | 1,994 | Lady Bouvier’s Lover | 7.5 | 1,014 | 59,503 | 15.1 |
| 6 | 104–601 | 104 | 1F22 | 1,994 | Bart of Darkness | 8.6 | 1,330 | 65,126 | 15.1 |
| 6 | 107–604 | 107 | 2F01 | 1,994 | Itchy & Scratchy Land | 8.5 | 1,277 | 72,722 | 14.8 |
| 6 | 111–608 | 111 | 2F05 | 1,994 | Lisa on Ice | 8.4 | 1,236 | 63,564 | 17.9 |
| 6 | 114–611 | 114 | 2F08 | 1,994 | Fear of Flying | 7.8 | 1,100 | 61,569 | 15.6 |
| 6 | 116–613 | 116 | 2F10 | 1,995 | And Maggie Makes Three | 8.5 | 1,284 | 63,051 | 17.3 |
| 6 | 118–615 | 118 | 2F12 | 1,995 | Homie the Clown | 8.5 | 1,254 | 73,123 | 17.6 |
| 6 | 120–617 | 120 | 2F14 | 1,995 | Homer vs. Patty and Selma | 7.9 | 1,006 | 60,599 | 18.9 |
| 6 | 123–620 | 123 | 2F18 | 1,995 | Two Dozen and One Greyhounds | 8.1 | 1,051 | 62,323 | 11.6 |
| 6 | 125–622 | 125 | 2F32 | 1,995 | ’Round Springfield | 8.3 | 1,084 | 56,001 | 12.6 |
| 6 | 127–624 | 127 | 2F22 | 1,995 | Lemon of Troy | 8.6 | 1,285 | 70,698 | 13.1 |
| 7 | 130–702 | 130 | 2F17 | 1,995 | Radioactive Man | 8.3 | 1,172 | 62,390 | 15.7 |
| 7 | 132–704 | 132 | 3F02 | 1,995 | Bart Sells His Soul | 8.7 | 1,354 | 65,333 | 14.8 |
Dialogues
| line_number | episode_id | number | role | location | line |
|---|---|---|---|---|---|
| 1 | 32 | 209 | Miss Hoover | Springfield Elementary School | No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it’s only natural that you think you have it. |
| 2 | 32 | 210 | Lisa Simpson | Springfield Elementary School | Where’s Mr. Bergstrom? |
| 3 | 32 | 211 | Miss Hoover | Springfield Elementary School | I don’t know. Although I’d sure like to talk to him. He didn’t touch my lesson plan. What did he teach you? |
| 4 | 32 | 212 | Lisa Simpson | Springfield Elementary School | That life is worth living. |
| 5 | 32 | 213 | Edna Krabappel-Flanders | Springfield Elementary School | The polls will be open from now until the end of recess. Now, just in case any of you have decided to put any thought into this, we’ll have our final statements. Martin? |
| 6 | 32 | 214 | Martin Prince | Springfield Elementary School | I don’t think there’s anything left to say. |
| 7 | 32 | 215 | Edna Krabappel-Flanders | Springfield Elementary School | Bart? |
| 8 | 32 | 216 | Bart Simpson | Springfield Elementary School | Victory party under the slide! |
| 9 | 32 | 218 | Lisa Simpson | Apartment Building | Mr. Bergstrom! Mr. Bergstrom! |
| 10 | 32 | 219 | Landlady | Apartment Building | Hey, hey, he Moved out this morning. He must have a new job – he took his Copernicus costume. |
| 11 | 32 | 220 | Lisa Simpson | Apartment Building | Do you know where I could find him? |
| 12 | 32 | 221 | Landlady | Apartment Building | I think he’s taking the next train to Capital City. |
| 13 | 32 | 222 | Lisa Simpson | Apartment Building | The train, how like him… traditional, yet environmentally sound. |
| 14 | 32 | 223 | Landlady | Apartment Building | Yes, and it’s been the backbone of our country since Leland Stanford drove that golden spike at Promontory point. |
| 15 | 32 | 224 | Lisa Simpson | Apartment Building | I see he touched you, too. |
| 16 | 32 | 226 | Bart Simpson | Springfield Elementary School | Hey, thanks for your vote, man. |
| 17 | 32 | 227 | Nelson Muntz | Springfield Elementary School | I didn’t vote. Voting’s for geeks. |
| 18 | 32 | 228 | Bart Simpson | Springfield Elementary School | Well, you got that right. Thanks for your vote, girls. |
| 19 | 32 | 229 | Terri/sherri | Springfield Elementary School | We forgot. |
| 20 | 32 | 230 | Bart Simpson | Springfield Elementary School | Well, don’t sweat it. Just so long as a couple of people did… right, Milhouse? |
| 21 | 32 | 231 | Milhouse Van Houten | Springfield Elementary School | Uh oh. |
| 22 | 32 | 232 | Bart Simpson | Springfield Elementary School | Lewis? |
| 23 | 32 | 233 | Bart Simpson | Springfield Elementary School | Somebody must have voted. |
| 24 | 32 | 234 | Milhouse Van Houten | Springfield Elementary School | What about you, Bart? Didn’t you vote? |
| 25 | 32 | 235 | Bart Simpson | Springfield Elementary School | Uh oh. |
| 26 | 32 | 237 | Wendell Borton | Springfield Elementary School | Yayyyyyyyyyyyyyy! |
| 27 | 32 | 238 | Bart Simpson | Springfield Elementary School | I demand a recount. |
| 28 | 32 | 239 | Edna Krabappel-Flanders | Springfield Elementary School | One for Martin, two for Martin. Would you like another recount? |
| 29 | 32 | 240 | Bart Simpson | Springfield Elementary School | No. |
| 30 | 32 | 241 | Edna Krabappel-Flanders | Springfield Elementary School | Well, I just want to make sure. One for Martin. Two for Martin. |
| 31 | 32 | 242 | Kid Reporter | Springfield Elementary School | This way, Mister President! |
| 32 | 32 | 244 | Conductor | Train Station | Now boarding on track 5, The afternoon delight coming to Shelbyville, Parkville, and….. |
| 33 | 32 | 245 | Lisa Simpson | Train Station | Mr. Bergstrom! Hey, Mr. Bergstrom! |
| 34 | 32 | 246 | BERGSTROM | Train Station | Hey, Lisa. |
| 35 | 32 | 247 | Lisa Simpson | Train Station | Hey, Lisa, indeed. |
| 36 | 32 | 248 | BERGSTROM | Train Station | What? What is it? |
| 37 | 32 | 249 | Lisa Simpson | Train Station | Oh, I mean, were you just going to leave, just like that? |
| 38 | 32 | 250 | BERGSTROM | Train Station | Ah, I’m sorry, Lisa. You know, it’s the life of the substitute teacher: he’s a fraud. Today he might be wearing gym shorts, tomorrow he’s speaking French, or, or, or pretending to know how to run a band saw, or God knows what. |
| 39 | 32 | 251 | Lisa Simpson | Train Station | You can’t go! You’re the best teacher I’ll ever have. |
| 40 | 32 | 252 | BERGSTROM | Train Station | Ah, that’s not true. Other teachers will come along who… |
| 41 | 32 | 253 | Lisa Simpson | Train Station | Oh, please. |
| 42 | 32 | 254 | BERGSTROM | Train Station | No, I can’t lie to you, I am the best. But, you know, they need me over in the projects of Capital City. |
| 43 | 32 | 255 | Lisa Simpson | Train Station | But I need you too. |
| 44 | 32 | 256 | BERGSTROM | Train Station | That’s the problem with being middle class. Anybody who really cares will abandon you for those who need it more. |
| 45 | 32 | 257 | Lisa Simpson | Train Station | I, I understand. Mr. Bergstrom, I’m going to miss you. |
| 46 | 32 | 258 | BERGSTROM | Train Station | I’ll tell you what… |
| 47 | 32 | 259 | BERGSTROM | Train Station | Whenever you feel like you’re alone and there’s nobody you can rely on, this is all you need to know. |
| 48 | 32 | 260 | Lisa Simpson | Train Station | Thank you, Mr. Bergstrom. |
| 49 | 32 | 261 | Conductor | Train Station | All aboard! |
| 50 | 32 | 262 | Lisa Simpson | Train Station | So, I guess this is it? It you don’t mind I’ll just run alongside the train as it speeds you from my life? |
Guests
| season | number | prod_code | title | guest_star | role | self |
|---|---|---|---|---|---|---|
| 1 | 002–102 | 7G02 | Bart the Genius | Marcia Wallace | Edna Krabappel-Flanders | FALSE |
| 1 | 002–102 | 7G02 | Bart the Genius | Marcia Wallace | Ms. Melon | FALSE |
| 1 | 003–103 | 7G03 | Homer’s Odyssey | Sam McMurray | Worker | FALSE |
| 1 | 003–103 | 7G03 | Homer’s Odyssey | Marcia Wallace | Edna Krabappel-Flanders | FALSE |
| 1 | 006–106 | 7G06 | Moaning Lisa | Miriam Flynn | Ms. Barr | FALSE |
| 1 | 006–106 | 7G06 | Moaning Lisa | Ron Taylor | Bleeding Gums Murphy | FALSE |
| 1 | 007–107 | 7G09 | The Call of the Simpsons | Albert Brooks | Cowboy Bob | FALSE |
| 1 | 008–108 | 7G07 | The Telltale Head | Marcia Wallace | Edna Krabappel-Flanders | FALSE |
| 1 | 009–109 | 7G11 | Life on the Fast Lane | Albert Brooks | Jacques | FALSE |
| 1 | 010–110 | 7G10 | Homer’s Night Out | Sam McMurray | Gulliver Dark | FALSE |
| 1 | 011–111 | 7G13 | The Crepes of Wrath | Christian Coffinet | Gendarme Officer | FALSE |
| 1 | 012–112 | 7G12 | Krusty Gets Busted | Kelsey Grammer | Sideshow Bob | FALSE |
| 1 | 013–113 | 7G01 | Some Enchanted Evening | June Foray | Babysitter service receptionist | FALSE |
| 1 | 013–113 | 7G01 | Some Enchanted Evening | June Foray | Doofy the Elf | FALSE |
| 1 | 013–113 | 7G01 | Some Enchanted Evening | Penny Marshall | Ms. Botz | FALSE |
| 1 | 013–113 | 7G01 | Some Enchanted Evening | Penny Marshall | Lucille Botzcowski | FALSE |
| 1 | 013–113 | 7G01 | Some Enchanted Evening | Paul Willson | Florist | FALSE |
| 2 | 014–201 | 7F03 | Bart Gets an “F” | Marcia Wallace | Edna Krabappel-Flanders | FALSE |
| 2 | 015–202 | 7F02 | Simpson and Delilah | Harvey Fierstein | Karl | FALSE |
| 2 | 016–203 | 7F04 | Treehouse of Horror | James Earl Jones | Removal man | FALSE |
| 2 | 016–203 | 7F04 | Treehouse of Horror | James Earl Jones | Serak the Preparer | FALSE |
| 2 | 016–203 | 7F04 | Treehouse of Horror | James Earl Jones | Narrator | FALSE |
| 2 | 018–205 | 7F05 | Dancin’ Homer | Tony Bennett | Himself | TRUE |
| 2 | 018–205 | 7F05 | Dancin’ Homer | Daryl Coley | Bleeding Gums Murphy | FALSE |
| 2 | 018–205 | 7F05 | Dancin’ Homer | Ken Levine | Dan Horde | FALSE |
| 2 | 018–205 | 7F05 | Dancin’ Homer | Tom Poston | Capital City Goofball | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Greg Berg | Rory | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Greg Berg | Eddie | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Greg Berg | Radio voice | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Greg Berg | “Hooray for Everything” Announcer | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Greg Berg | Security Man | FALSE |
| 2 | 020–207 | 7F07 | Bart vs. Thanksgiving | Carol Kane | Maggie Simpson | FALSE |
| 2 | 022–209 | 7F09 | Itchy & Scratchy & Marge | Alex Rocco | Roger Meyers Jr. | FALSE |
| 2 | 023–210 | 7F10 | Bart Gets Hit by a Car | Phil Hartman | Lionel Hutz | FALSE |
| 2 | 023–210 | 7F10 | Bart Gets Hit by a Car | Phil Hartman | Heaven | FALSE |
| 2 | 024–211 | 7F11 | One Fish, Two Fish, Blowfish, Blue Fish | Larry King | Himself | TRUE |
| 2 | 024–211 | 7F11 | One Fish, Two Fish, Blowfish, Blue Fish | Joey Miyashima | Toshiro | FALSE |
| 2 | 024–211 | 7F11 | One Fish, Two Fish, Blowfish, Blue Fish | Sab Shimono | Master Sushi Chef | FALSE |
| 2 | 024–211 | 7F11 | One Fish, Two Fish, Blowfish, Blue Fish | George Takei | Akira | FALSE |
| 2 | 024–211 | 7F11 | One Fish, Two Fish, Blowfish, Blue Fish | Diana Tanaka | Hostess | FALSE |
| 2 | 025–212 | 7F12 | The Way We Was | Jon Lovitz | Artie Ziff | FALSE |
| 2 | 025–212 | 7F12 | The Way We Was | Jon Lovitz | Mr. Seckofsky | FALSE |
| 2 | 026–213 | 7F13 | Homer vs. Lisa and the 8th Commandment | Phil Hartman | Troy McClure | FALSE |
| 2 | 026–213 | 7F13 | Homer vs. Lisa and the 8th Commandment | Phil Hartman | Moses | FALSE |
| 2 | 026–213 | 7F13 | Homer vs. Lisa and the 8th Commandment | Phil Hartman | Cable guy | FALSE |
| 2 | 027–214 | 7F15 | Principal Charming | Marcia Wallace | Edna Krabappel-Flanders | FALSE |
| 2 | 028–215 | 7F16 | Oh Brother, Where Art Thou? | Danny DeVito | Herbert Powell | FALSE |
| 2 | 029–216 | 7F14 | Bart’s Dog Gets an F | Tracey Ullman | Emily Winthropp | FALSE |
| 2 | 029–216 | 7F14 | Bart’s Dog Gets an F | Tracey Ullman | Sylvia Winfield | FALSE |
| 2 | 029–216 | 7F14 | Bart’s Dog Gets an F | Frank Welker | Santa’s Little Helper | FALSE |
Let us join into dialogues the information contained in both characters and episodes. This will allow us to know, for instance, the gender of the characters who talk the most, and the season of each episode. Notice that since dialogues terminates at episode 568 (episode 16 of season 26), the joined data set will also terminate at episode 568, even if episodes actually ends at episode 596 (episode 22 of season 27). Because the join for episodes 160, 161, 173 produced NAs, we manually inserted their season number (8).
dialogues <- dialogues %>%
left_join(characters, by = c("role" = "name")) %>%
mutate(gender = fct_explicit_na(gender, na_level = "Unknown")) %>%
left_join(episodes[,c("season", "episode_id")], by = "episode_id")
dialogues <- dialogues %>%
mutate(season = ifelse(is.na(season), 8, season)) %>%
select(line_number, season, number = episode_id, location, role, gender, line)We now join into episodes the information from guests. This will be useful to get, for instance, the guest star name of each episode (if any), and the role they played. Since episodes terminates at episode 596 (last episode of season 27), the joined data set also terminates at episode 596, even if guests terminates at episode 662 (episode 23 of season 30).
episodes <- episodes %>%
left_join(guests, by = c("number", "season", "prod_code", "title")) %>%
mutate(guest = ifelse(!is.na(role), TRUE, FALSE))Lastly, we tidy the dialogues, which puts the speaking lines into a one-word-per-row format.
Characters
The Simpsons is known for its vast ensemble of leading and supporting characters. The characters data set collects the names of 6722 characters that appeared throughout the seasons. In 95% of the cases, the gender of the character is not recorded. However, this is not problematic since those characters are not particularly relevant to the development of the show, as they only account for about 15% of the whole dialogues.
Frequency distribution
frequency_table <- function(df, group_var, align = NULL, prop = TRUE, head = nrow(df),
caption = ""){
group_var <- enquo(group_var)
col.names <- c(firstup(as_label(group_var)), "Frequency")
table <- df %>%
group_by(!! group_var) %>%
summarize(n = n()) %>%
arrange(desc(n))
if(prop){
col.names <- c(col.names, "Proportion")
table <- table %>%
mutate(prop = n / sum(n),
prop = percent(prop)) %>%
arrange(desc(prop))
}
table %>%
show_table(col.names = col.names, align = align, head = head, caption = caption)
}
characters %>%
frequency_table(gender, align = c("l", "r", "r"),
caption = "The gender of the Simpsons characters throughout 26 seasons")| Gender | Frequency | Proportion |
|---|---|---|
| Unknown | 6,399 | 95.2% |
| Male | 252 | 3.7% |
| Female | 71 | 1.1% |
Contributions to dialogues
dialogues.tidy %>%
frequency_table(gender, align = c("l", "r", "r"),
caption = "How each gender contributes to the dialogues of 26 seasons")| Gender | Frequency | Proportion |
|---|---|---|
| Male | 842,670 | 63.9% |
| Female | 274,362 | 20.8% |
| Unknown | 202,416 | 15.3% |
The Simpsons is characterized by a marked gender imbalance, which is also reflected in the show’s writing staff. More than 75% of the characters with recorded gender are male. The only female leading characters are Marge and Lisa Simpson. In contrast, among the supporting cast, we find Edna Krappabel-Flanders (the teacher at Springfield Elementary School), and Marge’s older sisters, the twins Selma and Patty Bouvier.
lollipop <- function(df, x, y, index = NULL, title = "", count = TRUE, ylab = ""){
x <- enquo(x)
y <- enquo(y)
if(count){
df <- df %>%
count(!! x, !! y, sort = TRUE)
}
if(!is.null(index)){
df <- df %>%
slice(index)
}
df %>%
mutate(variable = fct_reorder(!! x, n)) %>%
ggplot(aes(variable, n, fill = !! y)) +
geom_segment(aes(variable, xend = variable, y = 0, yend = n, color = !! y),
size = 0.8, show.legend = FALSE) +
geom_point(aes(color = !! y), size = 4, alpha = 0.6) +
coord_flip() +
labs(x = "", y = ylab, title = title) +
scale_fill_manual(name = firstup(as_label(y)), values = rev(colors[2:3])) +
scale_color_manual(name = firstup(as_label(y)), values = rev(colors[2:3])) +
scale_y_continuous(labels = comma) +
theme(legend.position = "bottom")
}
plot.top.chars <- dialogues.tidy %>%
lollipop(x = role, y = gender, index = c(1:10), title = "Top 10 Simpson leading characters",
ylab = "Number of spoken words in 26 seasons")
plot.supporting.chars <- dialogues.tidy %>%
lollipop(x = role, y = gender, index = c(11:35), title = "The Simpsons supporting cast",
ylab = "Number of spoken words in 26 seasons")
ggarrange(plot.top.chars, plot.supporting.chars, nrow = 1, common.legend = TRUE, legend="bottom",
widths = c(0.95, 1.05))
Locations
The Simpsons show is mainly settled in Springfield, a fictional town acting like a complete universe in which the characters can explore the issues faced by modern society. Although the locations data set reports 4459 distinct settings, most of the dialogues actually take place in way fewer places.
The following treemap depicts the 20 most common locations, that is, the settings where the characters had the most dialogue. At the top, we find The Simpson home, followed by Springfield Elementary School, and Moe’s Tavern. The majority of these locations denote indoor settings.
dialogues.tidy %>%
count(location, sort = TRUE) %>%
head(20) %>%
add_column("Space" = c(rep("Indoor", 8), rep("Outdoor", 3), rep("Indoor", 3), "Outdoor",
rep("Indoor", 3), "Outdoor", "Indoor")) %>%
mutate(Space = as.factor(Space)) %>%
ggplot(aes(area = n, label = location, fill = Space, subgroup = location)) +
geom_treemap( alpha = 0.8) +
geom_treemap_subgroup_border(color = "black", size = 0.85) +
geom_treemap_text(place = "centre", size = 13,
grow = FALSE, reflow = TRUE) +
scale_fill_manual(values = c("#62AF67ff", "#989DCDff")) +
theme(legend.position = "bottom") +
labs(title = "Top 20 locations",
subtitle = "Most dialogues take place at the Simpson home, and indoor.")Let us select the six most recurrent locations, and explore which characters have the most dialogue therein.
Simpson home: it is the house of the Simpson family. The most talkative characters here are of course the Simpson family members.
Springfield Elementary School: it is the local school on The Simpsons, attended by Bart and Lisa Simpson. Besides the two of them, other leading characters include the principal Skinner, the superintendent Chalmers and the teacher Edna.
Moe’s Tavern: it is the local bar in Springfield. The dialogue here mostly occurs between the owner Moe and his guests Homer, Lenny, Carl, and Barney.
Springfield Nuclear Power Plant: it is the nuclear power plant in Springfield. The leading characters here are Mr. Burns, who owns the plant, his executive assistant Smithers, and the employees Homer, Lenny, and Carl.
Kwik-E-Mart: it is the convenience store run by Apu. The dialogues here usually involve the owner of the store and the members of the Simpson family.
First Church of Springfield: it is the main religious house in Springfield. Most of the dialogues here occur between the Reverend, the Simpson family, and their very religious next-door neighbor Ned.
top.locations <- dialogues.tidy %>%
count(location, word, sort = TRUE) %>%
group_by(location) %>%
summarize(n = n()) %>%
arrange(desc(n)) %>%
top_n(6, n) %>%
pull(location)
set.seed(2)
dialogues.tidy %>%
filter(location %in% top.locations) %>%
count(role, location, sort = TRUE) %>%
group_by(location) %>%
top_n(5, n) %>%
ungroup() %>%
mutate(location = fct_reorder(location, -n, sum)) %>%
ggplot(aes(reorder_within(role, n, location), n, fill = fct_reorder(role, n, sum))) +
geom_col(show.legend = FALSE) +
coord_flip() +
facet_wrap(~location, scales = "free") +
scale_fill_manual(values = sample(palette)) +
labs(title = "Who speaks the most where?", x = "", y = "") +
scale_y_continuous(labels = comma) +
theme(panel.spacing = unit(0.5, "lines")) +
scale_x_reordered()The following plot depicts the distribution of the total number of words per season pronounced by the four Simpsons family members Homer, Marge, Bart, and Lisa through density plots. Whereas Homer Simpson stands out from any other character, the other three Simpson characters seem to have had comparable degrees of importance over the seasons.
Top20Chars <- dialogues.tidy %>%
count(role, word, sort = TRUE) %>%
distinct(role) %>%
head(20) %>%
pull(role)
dialogues.tidy %>%
filter(role %in% Top20Chars[1:4]) %>%
group_by(season, role) %>%
summarize(nwords = n()) %>%
ungroup() %>%
mutate(role = reorder(role, -nwords)) %>%
ggplot(aes(x=nwords, fill = fct_rev(role), color = fct_rev(role))) +
geom_density(size = 0.85, alpha = 0.65) +
scale_x_continuous(labels = comma) +
labs(x = "Number of words per season", y = "Density",
title = "Distribution of the total number of words per season") +
guides(fill = guide_legend(reverse = TRUE), color = guide_legend(reverse = TRUE)) +
scale_fill_manual(name = "Character", values = rev(palette[c(8,14,6,1)])) +
scale_color_manual(name = "Character", values = rev(palette[c(8,14,6,1)])) +
theme(legend.position = "bottom")
Episodes
The episodes data set contains information on the IMDb (Internet Movie Database) rating of each episode on a 1 - 10 scale, the number of votes it received, and the number of TV views in the United States.
Let us have a look at the trend of the ratings, the number of votes, and the TV views across episodes and averaged over seasons. We notice an overall downward for all three indicators. Older seasons are the most appreciated, most rated, and most viewed. As a matter of fact, the Simpsons show received acclaim throughout its first nine or ten seasons, which are generally considered its “Golden Age”, but was criticized for a perceived decline in quality over the years.
Concerning the downward trend observed for TV views though, the availability in more recent years of a variety of TV channels and Internet streaming platforms has caused a drop in the views not only for The Simpsons but also for many other network TV shows.
scatter_plot <- function(df, x, y, xlab = "Original air date", ylab, title,
breaks = NULL, limits = NULL, labels = comma){
x <- enquo(x)
y <- enquo(y)
p <- df %>%
ggplot(aes(!! x, !! y)) +
geom_point(size = 0.6) +
geom_smooth(method = "loess", formula = "y ~ x") +
labs(x = xlab, y = ylab, title = title)
if(!is.null(breaks)){
p <- p +
scale_y_continuous(labels = labels, breaks = breaks, limits = limits)
}else{
p <- p +
scale_y_continuous(labels = labels, limits = limits)
}
p
}
trend.rating <- episodes %>%
scatter_plot(year, rating, ylab = "Rating score", title = "IMDb ratings",
breaks = seq(0, 10, 2), limits = c(1, 10))
trend.votes <- episodes %>%
scatter_plot(year, votes, ylab = "Number of votes", title = "IMDb votes",
limits = c(0, 4000))
trend.views <- episodes %>%
scatter_plot(year, us_views, ylab = "Number of US viewers", title = "TV views in the US",
limits = c(0, 35), labels = unit_format(unit = "", scale = 1e+6, big.mark = ","))
grid.arrange(trend.rating, trend.votes, trend.views, nrow = 1, ncol = 3, widths = c(0.95, 0.97, 1.08))line_plot <- function(data, x, y, xlab = "Season", ylab, title, sub,
limits = NULL, breaks = NULL, labels = comma){
x <- enquo(x)
y <- enquo(y)
p <- data %>%
ggplot(aes(!! x, !! y)) +
geom_line(size = 1.2, color = "#8d99ae") +
geom_point(shape=21, color=colors[2], fill=colors[2], size=1) +
scale_x_continuous(breaks = seq(1, 27, 4)) +
labs(x = xlab, y = ylab, title = title, subtitle = sub) +
theme(plot.subtitle=element_text(size=9))
if(!is.null(breaks)){
p <- p +
scale_y_continuous(labels = labels, breaks = breaks, limits = limits)
}else{
p <- p +
scale_y_continuous(labels = labels, limits = limits)
}
p
}
episodes_byseason <- episodes %>%
group_by(season) %>%
summarize(avg_rate = mean(rating),
avg_vote = mean(votes),
avg_views = mean(us_views))
plot.rating <- episodes_byseason %>%
line_plot(season, avg_rate, ylab = "Rating score", title = "IMDb ratings",
sub = "Averaged by season \nOlder seasons are the most appreciated.",
breaks = seq(0, 10, 2), limits = c(1,10))
plot.vote <- episodes_byseason %>%
line_plot(season, avg_vote, ylab = "Number of voters", title = "IMDb votes",
sub = "Averaged by season \nOlder seasons are the most rated.",
limits = c(0, 2000))
plot.view <- episodes_byseason %>%
line_plot(season, avg_views, ylab = "Number of viewers", title = "TV views in the US",
sub = "Averaged by season \nOlder seasons are the most viewed.",
labels = unit_format(unit = "", scale = 1e+6, big.mark = ","),
limits = c(0, 30))
grid.arrange(plot.rating, plot.vote, plot.view, nrow = 1,
widths = c(0.95, 0.97, 1.08))The plots below show the distributions and the pairwise-relationships between IMDb ratings, IMDb votes, and TV views in the US. There seems to be a positive linear relationship between each pair of them.
line_smooth <- function(data, mapping, method="loess", ...){
p <- ggplot(data = data, mapping = mapping) +
geom_point(...) +
geom_smooth(method=method)
p
}
bi_density_plot <- function(data, mapping, palette = 4, ...){
p <- ggplot(data, mapping = mapping) +
stat_density_2d(aes(fill = stat(level)), geom = "polygon") +
scale_fill_distiller(palette = palette, direction = 1)
p
}
episodes %>%
select("IMDb rating" = rating, "IMDb votes" = votes, "TV views in US (millions)" = us_views) %>%
ggpairs(upper = list(continuous = wrap(bi_density_plot, palette = 1, size = 0.2)),
diag = list(continuous = wrap("barDiag", fill = "#8d99ae", bins = 27)),
lower = list(continuous = wrap(line_smooth, color = "black", size = 0.2))) +
ggtitle(label = "Pairwise variable comparisons")The graphical representation of the correlation matrix of episodes shows a strong inverse relationship between the episode number and the ratings, the votes, and the views. This indicates that more recent seasons are generally characterized by a decrease in those performance indicators. These indicators are also strongly correlated to one another, meaning that when one of those increases, so do the other two.
episodes %>%
select("Episode number" = episode_id, "IMDb rating" = rating,
"IMDb votes" = votes, "TV views in US" = us_views) %>%
cor() %>%
ggcorrplot(type = "lower", colors = c("#6D9EC1", "white", "#E46726"), outline.col = "white",
legend.title = "Correlation", lab = TRUE, ggtheme = ggplot2::theme_light()) +
labs(title = "Correlation matrix")
Guest stars
In addition to the show’s regular cast of voice actors, celebrity guest stars have been a staple of The Simpsons since its first season. Guest voices have come from a wide range of professions, including actors, athletes, authors, musicians, artists, politicians, and scientists.
The guests data set contains information about the guest stars that took part in every episode and their role. Let us have a look at the most recurring guest voices and roles.
Frequent guest stars
The most frequent guest stars across 30 seasons are Marcia Wallace, Phil Hartman, and Maurice LaMarche. Whereas Marcia Wallace was almost always playing the teacher Edna Krabappel-Flanders, the other two guest stars actually voiced several characters on the show.
guests %>%
group_by(guest_star) %>%
summarize(unique_roles = paste(unique(role), collapse = '; '),
count = n()) %>%
arrange(desc(count)) %>%
show_table(head = 5, caption = "Top 5 guest stars and the roles they played",
col.names = c("Guest star", "Roles they played", "Number of appearences"),
align = c("l", "l", "r"))| Guest star | Roles they played | Number of appearences |
|---|---|---|
| Marcia Wallace | Edna Krabappel-Flanders; Ms. Melon; Mrs. Krabapatra | 175 |
| Phil Hartman | Lionel Hutz; Heaven; Troy McClure; Moses; Cable guy; Plato; Joey; Godfather; Horst; Stockbroker; Smooth Jimmy Apollo; Lyle Lanley; Security Guard; Mandy Patinkin; Tom; Eddie Muntz; Evan Conover; Charlton Heston; Fat Tony; Hospital chairman; Bill Clinton | 73 |
| Maurice LaMarche | George C. Scott; Hannibal Lecter; Captain James T. Kirk; Eudora Welty; Commander McBragg; Orson Welles; Recruiter #2; Cap’n Crunch; First Mate Billy; Oceanographer; Farmer; Horn Stuffer; Fox announcer; Government Official; Jock; Toucan Sam; Trix Rabbit; Dwight D. Eisenhower; City Inspector; Nuclear Power Plant Guard; David Starsky; Anthony Hopkins; Charlie Sheen; Prepper; Chef Naziwa; Karl Malden; John Kerry; Milo; Football Commentator; Clive Meriwether; Neil Simon; Rodney Dangerfield; Morbo; Hedonismbot; Lrrr | 38 |
| Joe Mantegna | Fat Tony; Himself playing Fat Tony | 30 |
| Jon Lovitz | Artie Ziff; Mr. Seckofsky; Professor Lombardo; Aristotle Amadopolis; Mr. Devaro; Llewellyn Sinclair; Ms. Sinclair; Jay Sherman; Llewelyn Sinclair; Aristotle Amadopoulis; Enrico Irritazio; Cigarette; Himself; Hacky; Snitchy the Weasel; Rabbi | 28 |
Frequent roles
The most frequent roles played by guest stars are either themselves or some supporting characters, such as the teacher Edna Krabappel-Flanders, the gangster Fat Tony, and the actor Troy McClure.
guests %>%
frequency_table(role, head = 10, prop = FALSE, caption = "Top 10 guest star roles",
align = c("l", "r"))| Role | Frequency |
|---|---|
| Himself | 336 |
| Edna Krabappel-Flanders | 173 |
| Herself | 59 |
| Fat Tony | 30 |
| Troy McClure | 29 |
| Lionel Hutz | 25 |
| Sideshow Bob | 21 |
| Themselves | 19 |
| Rabbi Hyman Krustofsky | 11 |
| Mona Simpson | 9 |
Let’s have a look at the distribution of the number of guest star appearances over time. In the show’s early years, most guest stars have voiced original characters, but as the show has continued, the number of those appearing as themselves has slightly increased.
guests %>%
mutate(self = factor(self, levels = c(FALSE, TRUE),
labels=c("Playing an original character", "Playing themselves"))) %>%
group_by(season, self) %>%
summarize(n = n()) %>%
ggplot(aes(season, n, color = self)) +
geom_line(size = 1.2) +
scale_color_manual(name = "Guest star", values = c("#62AF67ff", "#989DCDff")) +
scale_x_continuous(breaks = seq(1, 30, 2)) +
theme(legend.position = "bottom") +
labs(x = "Season", y = "Number of guest star appearences",
title = "The number of guest star appearances over seasons")But who are the guest stars who played themselves in multiple episodes? At the top, we find the physicist and cosmologist Stephen Hawking with 4 appearances across 30 seasons, followed by the comic-book writer Stan Lee, the filmmaker Ken Burns, and the actor Gary Coleman with 3 occurrences. The gender imbalance in the original characters is reflected in the guest star appearances, with just 5 women playing themselves twice in 30 seasons.
guests %>%
filter(self) %>%
count(guest_star, sort = TRUE) %>%
filter(n > 1) %>%
add_column(gender = c(rep("Male", 7), "Female", rep("Male", 7), rep("Female", 2), "Male",
"Female", rep("Male", 14), rep("Female", 1), rep("Male", 2))) %>%
lollipop(x = guest_star, y = gender, ylab = "Number of appearences", count = FALSE,
title = "Who has played themselves in multiple Simpsons episodes?")Let us explore the number of lines generally reserved for guest stars. Because dialogues ends at season 27, we will consider the guests data up to that season 27. As expected, the guest stars with the most lines per episode are actually the ones voicing a role like a narrator or an announcer and are usually not playing themselves. The only exception of a guest star playing themselves and saying many lines is Lady Gaga.
guests <- guests %>%
mutate(role = ifelse(self, guest_star, role))
guests_summarized <- guests %>%
filter(season <= 27) %>%
group_by(guest_star, role, self) %>%
summarize(nb_episodes = n(),
first_season = min(season),
last_season = max(season))
guest_roles <- guests_summarized %>%
inner_join(dialogues %>%
count(role, sort = TRUE, name = "nb_lines"),
by = "role") %>%
mutate(lines_per_episode = nb_lines/ nb_episodes)
guest_roles %>%
arrange(desc(lines_per_episode)) %>%
show_table(head = 15, col.names = c("Guest star", "Role", "Playing themselves",
"Number of episodes", "First season", "Last season",
"Number of lines", "Number of lines per episode"),
align = c("l", "l", rep("r", 6)))| Guest star | Role | Playing themselves | Number of episodes | First season | Last season | Number of lines | Number of lines per episode |
|---|---|---|---|---|---|---|---|
| Larry McKay | Announcer | FALSE | 1 | 3 | 3 | 386 | 386 |
| Matt Groening | Announcer | FALSE | 1 | 23 | 23 | 386 | 386 |
| Phil Hartman | Fat Tony | FALSE | 1 | 7 | 7 | 276 | 276 |
| Clarence Clemons | Narrator | FALSE | 1 | 11 | 11 | 156 | 156 |
| Daniel Stern | Narrator | FALSE | 1 | 2 | 2 | 156 | 156 |
| George Fenneman | Narrator | FALSE | 1 | 5 | 5 | 156 | 156 |
| Jim Forbes | Narrator | FALSE | 1 | 11 | 11 | 156 | 156 |
| Ken Burns | Narrator | FALSE | 1 | 24 | 24 | 156 | 156 |
| Marc Wilmore | Narrator | FALSE | 1 | 26 | 26 | 156 | 156 |
| Matt Dillon | Louie | FALSE | 1 | 19 | 19 | 104 | 104 |
| Greg Berg | Eddie | FALSE | 1 | 2 | 2 | 96 | 96 |
| James Earl Jones | Narrator | FALSE | 2 | 2 | 9 | 156 | 78 |
| Lady Gaga | Lady Gaga | TRUE | 1 | 23 | 23 | 78 | 78 |
| Kristen Wiig | Annie Crawford | FALSE | 1 | 25 | 25 | 74 | 74 |
| Steve Carell | Dan Gillick | FALSE | 1 | 24 | 24 | 64 | 64 |
Many guest stars appeared in only one episode, and the distribution of the number of their lines per episode is positively skewed. Also, guest stars playing themselves tend to have fewer lines than those playing an actual character on the show.
guest_roles %>%
mutate(self = ifelse(self, "Playing Themselves", "Playing a Character")) %>%
ggplot(aes(lines_per_episode)) +
geom_histogram(aes(fill = self), binwidth = 2, center = 1, show.legend = FALSE) +
facet_wrap(~ self, ncol = 2) +
scale_fill_manual(values = c("#62AF67ff", "#989DCDff")) +
labs(x = "Average number of lines per episode", y = "Frequency",
subtitle = "Most guest stars, especially those playing themselves, have relatively few lines per episode")Let us compare the ratings, votes, and views of the episodes with guest stars playing themselves versus episodes without any guest star. It seems that episodes with guests starring themselves have, on average, relatively lower IMDb ratings, IMDb votes, and TV views than the episodes without any guest star.
episodes %>%
filter(is.na(self) | self == TRUE) %>%
group_by(guest) %>%
summarize(avg_rating = mean(rating),
avg_votes = mean(votes),
avg_views = mean(us_views)) %>%
mutate(guest = as.factor(guest),
guest = fct_recode(guest, `Playing themselves` = "TRUE",
`Absent` = "FALSE")) %>%
show_table(col.names = c("Guest star", "IMDb rating",
"IMDb votes", "TV views in US (millions)"),
align = c("l", rep("r", 3)), caption = "Average performance indicators")| Guest star | IMDb rating | IMDb votes | TV views in US (millions) |
|---|---|---|---|
| Absent | 7.39 | 853.31 | 12.25 |
| Playing themselves | 7.28 | 777.82 | 11.22 |
plot_violin <- function(df, x, y, z, ylab, title = "", limits = NULL,
breaks = NULL, labels = comma){
x <- enquo(x)
y <- enquo(y)
z <- enquo(z)
data_summary <- function(x) {
m <- mean(x)
ymin <- m-sd(x)
ymax <- m+sd(x)
return(c(y=m,ymin=ymin,ymax=ymax))
}
p <- df %>%
filter(is.na(!! x) | !! x == TRUE) %>%
mutate(y = as.factor(!! y)) %>%
group_by(y) %>%
ggplot(aes(y, !! z, fill = y)) +
geom_violin() +
scale_fill_manual(name = "Guest star", values = c("#fcbf49", "#989DCDff"),
labels = c("FALSE" = "Absent", "TRUE" = "Playing themselves")) +
scale_x_discrete(labels = c("FALSE" = "", "TRUE" = "")) +
labs(x = "", y = ylab, title = title) +
stat_summary(fun.data = data_summary, geom = "pointrange", color = "black",
show.legend = FALSE)
if(!is.null(breaks)){
p <- p +
scale_y_continuous(labels = labels, breaks = breaks, limits = limits)
}else{
p <- p +
scale_y_continuous(labels = labels, limits = limits)
}
p
}
plot.guest.rating <- episodes %>%
plot_violin(x = self, y = guest, z = rating, ylab = "Rating", title = "IMDb ratings",
limits = c(0, 10), breaks = seq(0, 10, 2))
plot.guest.vote <- episodes %>%
plot_violin(x = self, y = guest, z = votes, ylab = "Votes", title = "IMDb votes",
labels = comma, limits = c(0, 4000))
plot.guest.views <- episodes %>%
plot_violin(x = self, y = guest, z = us_views, ylab = "Views", title = "TV views in the US",
labels = unit_format(unit = "", scale = 1e+6, big.mark = ","),
limits = c(0, 35))
ggarrange(plot.guest.rating, plot.guest.vote, plot.guest.views, nrow = 1, common.legend = TRUE,
legend="bottom", widths = c(0.95, 0.97, 1.08))Text analysis
Let us now carry out some text analysis on dialogues. In this scenario, the episodes of the show are acting as the documents of the corpus.
Word and document frequency
Let us look at the most frequent words. By choosing a distinct combination of role, word, and line number, we are preventing from counting the same word from the same line multiple times. The most recurrent words - after removing the stop words - seem to be related to characters addressing each other.
dialogues.tidy <- dialogues.tidy %>%
anti_join(stop_words, by = "word")
dialogues.summarized <- dialogues.tidy %>%
distinct(role, line_number, word) %>%
count(role, word, sort = TRUE)
dialogues.summarized %>%
show_table(head = 15, col.names = c("Character", "Word", "Frequency"),
align = c("l", "r", "r"))| Character | Word | Frequency |
|---|---|---|
| Homer Simpson | marge | 1,752 |
| Marge Simpson | homer | 1,319 |
| Lisa Simpson | dad | 1,076 |
| Homer Simpson | hey | 933 |
| Bart Simpson | dad | 876 |
| Lisa Simpson | bart | 708 |
| Homer Simpson | gonna | 694 |
| Homer Simpson | yeah | 691 |
| Bart Simpson | hey | 662 |
| Lisa Simpson | mom | 612 |
| Homer Simpson | uh | 607 |
| Homer Simpson | boy | 583 |
| Marge Simpson | bart | 570 |
| Homer Simpson | time | 558 |
| Marge Simpson | homie | 525 |
We can now compute the term frequency (tf), the inverse document frequency (idf), and the tf-idf. The latter looks for the most important words in each document that are not too common in other documents. In our case, this means finding the words that are peculiar to a particular character, but generally not to other characters.
role.specificity <- dialogues.summarized %>%
group_by(role) %>%
mutate(total_words = sum(n)) %>%
ungroup() %>%
bind_tf_idf(word, role, n) %>%
arrange(desc(tf_idf))The tf-idf can be used as a catchphrase detector. We are looking at the characters with a fair amount of dialogues (i.e., more than 500 words), and keep one row for each character (to find one peculiar word for every role). The plot below shows the words and characters with the highest tf-idf. For some characters, the peculiar word is the name of the character they usually talk to (e.g., Smithers saying ‘sir’ or Agnes Skinner saying ‘Seymour’). In contrast, for others, it is either the word they use to introduce themselves (e.g., Troy McClure saying ‘I’m Troy McClure’) or recurring sounds (e.g., the Captain going ‘Arrr’ or Nelson ‘haw’).
role.specificity %>%
filter(total_words >= 500) %>%
distinct(role, .keep_all = TRUE) %>%
mutate(role_word = paste0(role, ": ", word)) %>%
head(20) %>%
mutate(role_word = fct_reorder(role_word, tf_idf)) %>%
ggplot(aes(role_word, tf_idf)) +
geom_segment(aes(role_word, xend = role_word, y = 0, yend = tf_idf),
color = "#8d99ae", size = 0.8, show.legend = FALSE) +
geom_point(color = "#8d99ae", size = 4, alpha = 0.6) +
coord_flip() +
labs(title = "Using TF-IDF as a catchphrase detector",
subtitle = "Top 20 characters speaking at least 500 words in 27 seasons.",
x = "", y = "TF-IDF")Bigrams Analysis
Let us now focus on the bigrams, that is, the pairs of words that occurr often together.
Frequent bigrams
The most recurrent bigrams concern the names of the Simpson family (e.g., ‘homer simpson’, ‘bart simpson’, ‘lisa simpson’) or some onomatopoeia (e.g., ‘woo hoo’, ‘hey hey’, ‘la la’).
dialogue_bigram <- dialogues %>%
unnest_tokens(bigram, line, token = "ngrams", n = 2)
dialogue_filtered <- dialogue_bigram %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word & !is.na(word1)) %>%
filter(!word2 %in% stop_words$word & !is.na(word2))
bigram_counts <- dialogue_filtered %>%
count(word1, word2, sort = TRUE)
bigram_united <- dialogue_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigram_united %>%
count(bigram, sort = TRUE) %>%
show_table(head = 10, col.names = c("Bigram", "Frequency"),
caption = "Top 10 bigrams throughout 26 seasons", align = c("l", "r"))| Bigram | Frequency |
|---|---|
| homer simpson | 461 |
| woo hoo | 360 |
| hey hey | 311 |
| la la | 268 |
| bart simpson | 258 |
| heh heh | 221 |
| ha ha | 215 |
| uh huh | 210 |
| haw haw | 184 |
| lisa simpson | 174 |
Peculiar bigrams
The peculiar bigrams were found as the bigrams with the largest tf-idf and that occurred over 50 times. At the top positions, we find the signature mocking laugh of Nelson ‘haw haw’, and ‘kent brockman’ as the TV announcer is always starting off with ‘This is Kent Brockman’.
bigram_tf_idf <- bigram_united %>%
count(role, bigram) %>%
bind_tf_idf(bigram, role, n) %>%
arrange(desc(tf_idf))
bigram_tf_idf %>%
filter(n > 50) %>%
distinct(role, .keep_all = TRUE) %>%
show_table(col.names = c("Character", "Bigram", "Frequency", "tf", "idf", "tf-idf"),
align = c("l", rep("r", 5)),
caption = "Peculiar bigrams occurring over 50 times throughout 26 seasons")| Character | Bigram | Frequency | tf | idf | tf-idf |
|---|---|---|---|---|---|
| Nelson Muntz | haw haw | 133 | 0.13 | 4.83 | 0.62 |
| Kent Brockman | kent brockman | 70 | 0.02 | 5.45 | 0.13 |
| Krusty the Clown | hey hey | 80 | 0.03 | 4.09 | 0.13 |
| Moe Szyslak | hey hey | 57 | 0.02 | 4.09 | 0.06 |
| Homer Simpson | woo hoo | 311 | 0.01 | 5.15 | 0.06 |
| Bart Simpson | hey dad | 55 | 0.00 | 6.50 | 0.03 |
Networks
The relationships across the bigrams can be depicted through a network plot. To keep the plot readable, we consider bigrams that occurred at least 30 times. The nodes from which the most arrows are departing seem to be ‘simpson’, ‘dollars’, and ‘hoo’. The most common bigrams actually refer to character names, locations, or onomatopeia.
bigram.graph <- bigram_counts %>%
filter(n > 30) %>%
graph_from_data_frame()
set.seed(1234)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram.graph, layout = "fr") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
arrow = a, end_cap = circle(.07, "inches")) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()Sentiment Analysis
We can carry out a sentiment analysis to explore the feelings that emerge from the Simpsons dialogues. We are using the ‘bing’ lexicon, which attributes a positive or negative valence to each word in its vocabulary.
We now investigate which words - occurring at least 400 times - most contribute to the positive and negative sentiments. Among the positive words, we find e.g., ‘love’, ‘wow’, ‘nice’, and ‘fine’, whereas among the negative ones ‘bad’, ‘burns’, ‘stupid’, and ‘kill’. The word ‘burns’ is associated with a negative sentiment because it is seen as coming from the verb ‘to burn’, which clearly has a negative connotation. In the Simpsons case though, the word ‘burn’ is likely to refer to the character called Mr. Burns. However, due to the evil and greedy nature of the character himself, the graph seems pretty accurate after all!
simpsons_sentiments <- dialogues.tidy %>%
inner_join(get_sentiments("bing"), by = "word")
bar_chart_sentiment <- function(df, x = NULL, y, z, slice = NULL, count = TRUE){
y <- enquo(y)
z <- enquo(z)
if(count){
x <- enquo(x)
df <- df %>%
count(!! x, !! y, !! z) %>%
mutate(n = ifelse(!! z == "negative", -n, n),
!! x := reorder(!! x, -abs(n), sum),
!! y := reorder_within(!! y, n, !! x)) %>%
group_by(!! x) %>%
arrange(desc(abs(n)))
}
if(!is.null(slice)){
df <- df %>%
slice(slice)
}
df %>%
ggplot(aes(!! y, n, fill = !! z)) +
geom_bar(stat = "identity") +
coord_flip() +
scale_fill_manual(name = "Sentiment", values = c("#f1776a", "#8bc384")) +
labs(x = "Word", y = "Contribution to sentiment") +
scale_x_reordered() +
theme(legend.position = "bottom")
}
simpsons_sentiments %>%
count(sentiment, word) %>%
ungroup() %>%
filter(n > 400) %>%
mutate(n = ifelse(sentiment == "negative", -n, n)) %>%
mutate(word = reorder(word, n)) %>%
bar_chart_sentiment(y = word, z = sentiment, count = FALSE)The words that are most contributing to the positive (e.g., ‘love’, ‘nice’, ‘fine’) and negative sentiments (e.g., ‘bad’, ‘wrong’) do not seem to be character-dependent.
simpsons_sentiments %>%
filter(role %in% Top20Chars[1:6]) %>%
bar_chart_sentiment(x = role, y = word, z = sentiment, slice = c(1:12)) +
facet_wrap(~ role, scales = "free")